System and Method to Predict and Prevent Power Supply Failures based on Data Center Environmental Behavior

ABSTRACT

An information handling system may include a first power supply for a first system, a second power supply for a second system, and a management controller. The management controller may detect that the first power supply has failed, receive first information from the first system related to the operation of the first power supply prior to the failure of the first power supply, receive second information from the second system associated with the second power supply, and determine a probability of failure of the second power supply based upon a comparison of the first information with the second information.

CROSS REFERENCE TO RELATED APPLICATION

Related subject matter is contained in co-pending U.S. patentapplication Ser. No. 15/______ (DC-110997) entitled “System and Methodto Prevent Power Supply Failures based on Data Center EnvironmentalBehavior,” filed of even date herewith, the disclosure of which ishereby incorporated by reference.

FIELD OF THE DISCLOSURE

This disclosure generally relates to information handling systems, andmore particularly relates to a system and method to predict and preventpower supply failures based on data center environmental behavior.

BACKGROUND

As the value and use of information continues to increase, individualsand businesses seek additional ways to process and store information.One option is an information handling system. An information handlingsystem generally processes, compiles, stores, and/or communicatesinformation or data for business, personal, or other purposes. Becausetechnology and information handling needs and requirements may varybetween different applications, information handling systems may alsovary regarding what information is handled, how the information ishandled, how much information is processed, stored, or communicated, andhow quickly and efficiently the information may be processed, stored, orcommunicated. The variations in information handling systems allow forinformation handling systems to be general or configured for a specificuser or specific use such as financial transaction processing,reservations, enterprise data storage, or global communications. Inaddition, information handling systems may include a variety of hardwareand software resources that may be configured to process, store, andcommunicate information and may include one or more computer systems,data storage systems, and networking systems.

SUMMARY

An information handling system may include a first power supply for afirst system, a second power supply for a second system, and amanagement controller. The management controller may detect that thefirst power supply has failed, receive first information from the firstsystem related to the operation of the first power supply prior to thefailure of the first power supply, receive second information from thesecond system associated with the second power supply, and determine aprobability of failure of the second power supply based upon acomparison of the first information with the second information.

BRIEF DESCRIPTION OF THE DRAWINGS

It will be appreciated that for simplicity and clarity of illustration,elements illustrated in the Figures have not necessarily been drawn toscale. For example, the dimensions of some of the elements areexaggerated relative to other elements. Embodiments incorporatingteachings of the present disclosure are shown and described with respectto the drawings presented herein, in which:

FIG. 1 is a block diagram of a data center according to an embodiment ofthe present disclosure;

FIG. 2 is a block diagram of a server rack according to an embodiment ofthe present disclosure;

FIG. 3 is a flowchart illustrating a method for predicting power supplyfailures in a data center according to an embodiment of the presentdisclosure;

FIG. 4 is a flowchart illustrating a method for minimizing power supplyfailures in systems with redundant power supplies according to anembodiment of the present disclosure; and

FIG. 5 is a block diagram illustrating a generalized informationhandling system according to an embodiment of the present disclosure.

The use of the same reference symbols in different drawings indicatessimilar or identical items.

DETAILED DESCRIPTION OF DRAWINGS

The following description in combination with the Figures is provided toassist in understanding the teachings disclosed herein. The followingdiscussion will focus on specific implementations and embodiments of theteachings. This focus is provided to assist in describing the teachings,and should not be interpreted as a limitation on the scope orapplicability of the teachings. However, other teachings can certainlybe used in this application. The teachings can also be used in otherapplications, and with several different types of architectures, such asdistributed computing architectures, client/server architectures, ormiddleware server architectures and associated resources.

FIG. 1 illustrates an embodiment of a data center 100 including serveraisle 105, and a data center management controller (DCMC) 160. Serveraisle 105 includes server racks 110, 120, 130, 140, and 150. Serverracks 120, 130, 140, and 150 include various equipment that operates toperform the data storage and processing functions of data center 100. Assuch, one or more elements of data center 100, including the equipmentin server racks 120, 130, 140, and 150 and DCMC 160 can be realized asan information handling system. Each of server racks 110, 120, 130, 140,and 150 includes a respective rack management controller (RMC) 112, 122,132, 142, and 152. RMCs 112, 122, 132, 142, and 152 represent serviceprocessors that operate in accordance with an Intelligent PlatformManagement Interface (IPMI) functional implementation to providemonitoring, management, and maintenance functionality to the respectiveserver racks 110, 120, 130, 140, and 150, and to the equipment therein.Examples of RMCs 112, 122, 132, 142, and 152 can include serviceprocessors, such as baseboard management controllers (BMCs), anIntegrated Dell Remote Access Controller (iDRAC), another serviceprocessor, or a combination thereof. RMCs 112, 122, 132, 142, and 152,and DCMC 160 are connected together with various service processorsassociated with the equipment in server racks 110, 120, 130, 140, and150 into a management network whereby the DCMC can remotely manage theequipment in the server racks. Server racks 110, 120, 130, 140, and 150each include one or more power supplies to provide regulated andmonitored power to the equipment within the server racks, as describedfurther, below.

FIG. 2 illustrates a server rack 200 typical of server racks 110, 120,130, 140, and 150. Server rack 200 includes a rack space that representsa standard server rack, such as a 19-inch rack equipment mounting frameor a 23-inch rack equipment mounting frame, and includes six rack units.The rack units represent special divisions of the rack space that are astandardized unit of, for example, 1.75 inches high. For example, apiece of equipment that will fit into one of the rack units shall hereinbe referred to as a 1-U piece of equipment, another piece of equipmentthat takes up two of the rack units is commonly referred to as a 2-Upiece of equipment, and so forth. As such, the rack units are numberedsequentially from the bottom to the top as 1U, 2U, 3U, 4U, 5U, and 6U.The skilled artisan will recognize that other configurations for rackunits can be utilized, and that a greater or lesser number of rack unitsin a server rack may be utilized, as needed or desired. For example, arack unit can be defined by the Electronic Components IndustryAssociation standards council.

Server rack 200 further includes a rack management controller 210 and arack management switch 220, and is illustrated as being populated withtwo 2-U servers 230 and 240, and with two 1-U servers 250 and 260. 2-Userver 230 is installed in rack spaces 1U and 2U, 2-U server 240 isinstalled in rack spaces 3U and 4U, 1-U server 250 is installed in rackspace 5U, and 1-U server 260 is installed in rack space 6U. Rackmanagement controller 210 includes network connections 212 and 214, andrack switch 220 includes network connections 222, 223, 224, 225, and226. As illustrated, rack management controller 210 is connected vianetwork connection 214 to a management network that includes a DCMC 280similar to DCMC 160, and is connected via network connection 212 tonetwork connection 222 of rack switch 220 to extend the managementnetwork to servers 230, 340, and 250. As such, server 230 includes a BMC231 that is connected to network connection 223 via a network connection232, server 240 includes a BMC 241 that is connected to networkconnection 224 via a network connection 242, server 250 includes a BMC251 that is connected to network connection 225 via a network connection252, and server 260 includes a BMC 261 that is connected to networkconnection 226 via a network connection 262. Here, the managementnetwork includes RMC 210, BMCs 231, 241, 251, and 261, and DCMC 280.DCMC 280 is configured to initiate management transactions with RMC 210to monitor, manage, and maintain elements of server rack 200, ofmanagement switch 220, and of servers 230, 240, 250, and 260 viarespective BMCs 321, 241, 251, and 261.

Server rack 200 further includes a power distribution unit (PDU) 270.PDU 270 operates to provide AC power to receptacles 271, 272, 273, 274,275, and 276 from a power distribution network of data center 100. Eachof receptacles 271-276 are associated with a rack unit of the serverrack. Thus rack unit 1U is associated with receptacle 271, rack unit 2Uis associated with receptacle 272, rack unit 3U is associated withreceptacle 273, rack unit 4U is associated with receptacle 274, rackunit 5U is associated with receptacle 275, and rack unit 6U isassociated with receptacle 276. Server 230 includes a pair of powersupplies 234 and 236, server 240 includes a pair of power supplies 244and 246, server 250 includes a power supply 254, and server 260 includesa power supply 264. Power supplies 234, 236, 244, 246, 254, and 264 eachoperate to receive AC power from the power distribution network, toconvert and regulate the power from the AC voltage level to various DCvoltage levels as used by respective servers 230, 240, 250, and 260, toprovide operational and status information related to the power usageand health of the various DC voltage rails provided, and to provideother control and operational settings for the various DC voltage rails.Power supply 234 is connected to receive AC power from receptacle 271,power supply 236 is connected to receive AC power from receptacle 272,power supply 244 is connected to receive AC power from receptacle 273,power supply 246 is connected to receive AC power from receptacle 274,power supply 254 is connected to receive AC power from receptacle 275,and power supply 264 is connected to receive AC power from receptacle276.

Power supplies 234 and 236 operate as redundant power supplies forserver 230. That is, each of power supplies 234 and 236 are configuredto provide the full operating power of server 230 without the need tooperate the other power supply. As such, power supplies 234 and 236 areoperated as redundant power supplies, such that when one of the powersupplies fails, the other power supply is seamlessly brought on line totake over powering server 230. Thus, power supplies 234 and 236 aretypically configured as hot-swappable power supplies, such that when theone power supply fails, an indication is provided to BMC 231 and aservice technician can be dispatched to replace the failing powersupply. Similarly, power supplies 244 and 246 are configured to providethe full operating power of server 240 without the need to operate theother power supply, and operate as redundant power supplies.

BMC 231 operates to monitor, manage, and maintain server 230. Inmonitoring server 230, BMC 231 accesses various sensors for monitoringvarious physical characteristics of the server, such as temperaturesensors, fan speed sensors, voltage sensors on the various voltagerails, current sensors on the voltage rails, humidity sensors, and thelike, to characterize the environmental conditions within which theserver is operating. BMC 231 further accesses various state informationof the elements of server 230, such as by accessing state informationfrom power supplies 234 and 236, processor, memory, or I/O stateinformation related to the operating condition of the elements of theserver, and the like, to further characterize the environmentalconditions within the elements of the server. BMC 231 further accessesvarious software and firmware information, such as processor loadinginformation, memory and storage utilization information, network and I/Obandwidth information, and the like, to characterize the processingconditions of server 230. BMC 231 further operates to provide themonitoring information to RMC 210 and to DCMC 280 via the managementnetwork, as needed or desired.

In managing server 230, BMC 231 utilizes the monitoring information fromthe server to provide inputs to control various physical operations ofthe server, such as fan speeds, voltage levels, processor clock rates,I/O speeds, and the like, to ensure that the environmental conditions ofthe server and the elements thereof remain within acceptable limits. BMC231 further operates to provide indications as to the environmental andprocessing conditions to RMC 210 and DCMC 280, as needed or desired. Forexample, when temperature conditions within server 230 exceed aparticular threshold, then BMC 231 can provide a high-temp indication tothat effect to RMC 210 and the RMC can direct a heating/ventilation/airconditioning (HVAC) system of the data center to direct additionalcooling to server 230. Similarly, when temperature conditions withinserver 230 drop below another threshold, then BMC 231 can provide alow-temp indication to that effect to RMC 210 and the RMC can direct theHVAC system to direct less cooling to server 230.

In managing server 230, BMC 231 further utilizes the monitoringinformation from the server to provide inputs to control variousprocessing conditions of the server. For example, when processingconditions, such as processor loading or memory utilization, withinserver 230 exceed a particular threshold, then BMC 231 can provide ahigh-utilization indication to that effect to DCMC 280, and the DCMC candirect spare servers of the data center to come on line to offload theworkload from the server. Similarly, when processing conditions withinserver 230 drop below another threshold, then BMC 231 can provide alow-utilization indication to that effect to DCMC 280 and the DCMC candirect initiate steps to shut down server 230 and place it into a poolof spare servers for the data center.

In maintaining server 230, BMC 231 operates to ensure that the operatingthresholds, set points, and other limits are maintained, and to reset orchange the thresholds, set points and other limits as needed or desired.Further, BMC 231 operates to maintain software and firmware in server230, as needed or desired. BMCs 241, 251, and 261 operate to monitor,manage, and maintain respective servers 240, 250, and 260 similarly tothe way that BMC 231 operates on server 230, as described above.Moreover, RMC 210 operates to monitor, manage, and maintain functionsand features of server rack 200, as needed or desired. Further, RMC 210operates as a central node of the management network between BMCs 231,241, 251, and 261, and DCMC 280, colleting monitoring information fromthe BMCs and providing management and maintenance information to theBMCs, as needed or desired.

DCMC 280 includes a failure predictor 282. Failure predictor 282operates to receive the monitoring information from RMC 210 and fromBMCs 231, 241, 251, and 261, to evaluate the monitoring information asit relates to past failure of the elements of server rack 200 and ofservers 230, 240, 250, and 260, and to provide a prediction of thelikelihood that one or more of the elements of the server rack and theservers will experience a failure. When the likelihood of a failure ofthe one or more elements of server rack 200 and servers 230, 240, 250,and 260 grows to great, DCMC 280 operates to provide indications of theimpending failure, and to take steps to mitigate the consequences ofsuch a failure. In general, failure predictor 282 operates to detectwhen a failure occurs, to log the monitoring information from RMC 210and BMCs 231, 241, 251, and 261, to correlate the failure to particularcomponents of the monitoring information, and to detect when thecorrelated components of the monitoring information is in a similarcondition as existed when the failure occurred, and to provide anindication and take steps to mitigate the failure. Moreover, each time anew failure for a similar component occurs, failure predictor 282operates to refine the prediction mechanism by more closely correlatingthe new failures with the particular conditions at the time of the newfailures.

In a particular embodiment, failure predictor 282 operates to predictwhen one of power supplies 234, 236, 244, 246, 254, and 256 is likely tofail, based upon the monitoring information derived from RMC 210 andBMCs 231, 241, 251, and 261 at the time of a previous failure of one ofthe power supplies. For example, when a first one of power supplies 234,236, 244, 246, 254, and 256 fails, failure predictor 282 operates toreceive monitoring information from BMCs 231, 241, 251, and 261. Themonitoring information includes server rack and server environmentalinformation, processing condition information, and other monitoringinformation, as needed or desired. For example, failure predictor 282can direct BMCs 231, 241, 251, and 261 and RMC 210 to provide theenvironmental information for servers 230 240, 250, and 260, and forserver rack 200. In addition, failure predictor can direct the BMC 231,241, 251, or 261 associated with the failing power supply to provide amake and model of the failing power supply 234, 236, 244, 246, 254, or264, the configuration settings for the failing power supply, apredicted failure rate for the failing power supply, an age of thefailing power supply when it failed, an age differential betweenredundant power supplies, a power usage of the failing power supply, apower source for the failing power supply, or the like, as needed ordesired.

The predicted failure rate can include a rate generated by failurepredictor 282 based upon historical failure rates in the data center, orcan be provided by a manufacturer of the failing power supply. The ageof the power supply can be in terms of a chronological age, or in termsof a watt-hour usage age of the power supply. The power usage of thefailing power supply can be provided in terms of an efficiency at whichthe power supply converts voltages from the AC supply voltage to thevarious DC supply rails. For example, where a particular power supplyexperiences a rapid loss in efficiency, this may be a suitable indicatorfor the impending failure of additional power supplies. The power sourcefor a failing power supply may be a given power bus bar of the datacenter. For example, a typical data center may be provided with 3-phaseAC power and each phase may be utilized to power a separate bus bar, andthe various power supplies of a data center may be provided with sourcepower from different bus bars.

Once a first power supply has failed, failure predictor 282 operates toevaluate each other power supply to determine if the operating conditionof the other power supplies matches the conditions present at the timeof failure. In particular, failure predictor 282 can ascribe aprobability to the likelihood that a particular power supply will failas;

P _(fail) =f ₁(Hardware)*f ₂(Efficiency)*f ₃(Age)*f ₄(Environment)  Eq.1

where P_(fail) is the probability that a particular power supply willfail (a number between 0 and 1, with 0 indicating a low probability offailure and 1 indicating a high probability of failure), f₁(Hardware) isa factor that relates the make, model, and manufacturer of the evaluatedpower supply to the failed power supply, f₂(Efficiency) is a factor thatrelates the power usage and the power source of the evaluated powersupply to the failed power supply, f₃(Age) is a factor that relates theage of the evaluated power supply to the failed power supply, andf₄(Environment) is a factor that relates the environmental conditions atthe time of the failure of the failed power supply to the currentenvironmental conditions.

An example of the hardware factor may be such that where the evaluatedpower supply is a same make and model as the failed power supply, thehardware factor is set to 1, where the make is the same but they aredifferent models, the hardware factor is set to 0.5, and where they areof different makes, the hardware factor is set to 0.0 (or to a non-zerominimum, such as 0.01). An example of the efficiency may be such thatwhen the power factor of the evaluated power supply is greater than thatof the failed power supply at the time of failure, the efficiency factoris 1, and when the power factors are substantially equal, the efficiencyfactor is 0.50, and when the evaluated power supply has a lower powerfactor, then the efficiency factor is 0.25. Similarly, when theevaluated power supply is older than the failed power supply when itfailed, the age factor is set to 1.0, when the evaluated power supply issubstantially as old, then the age factor is set to 0.05, and when theevaluated power supply is younger, then the age factor is set to 0.10.The coefficients f₁, f₂, f₃, and f₄ are selected to provide a relativeweight between the factors, such that f₁+f₂+f₃+f₄=1. For example, it maybe determined that an increase in the power factor is most indicative ofa failure, followed by age, hardware, and environment. As such f₂ may beset to 0.50, f₃ may be set to 0.25, f₁ may be set to 0.15, and f₄ may beset to 0.10.

Failure predictor 282 further operates to establish failure thresholdsthat provide a level based indication of the likelihood that aparticular power supply will fail. For example, when P_(aid)=0.0-0.3,the likelihood of failure can be deemed to be low, indicating no actionis needed, when P_(fail)=0.3-0.6, the likelihood of failure can bedeemed to be medium, indicating that the power supply should bemonitored, P_(aid)=0.6-0.8, the likelihood of failure can be deemed tobe high, indicating that the power supply should be monitored carefullyand alternatives prepared, and P_(fail)=0.8-1.0, the likelihood offailure can be deemed to be very high, indicating that failure iseminent and that action should be taken to replace the evaluated powersupply.

Finally, when the evaluated power supply is a redundant power supply,such as power supplies 234 and 236, or power supplies 244 and 246, thenfailure predictor 282 further operates to provide a command switchactive power supplies when the likelihood of failure become very high.For example, if power supply 234 has a likelihood of failure of 0.9,then failure predictor 282 can direct BMC 231 to switch power supply 234to the standby mode and to switch power supply 236 to the active mode.Note that a failure predictor similar to failure predictor 282 can beimplemented in RMC 210, or in one or more of BMCs 231, 241, 251, and261. In this way, failures of power supplies in a data center areavoided, resulting in improved reliability and operational efficiency ofthe data center.

FIG. 3 illustrates a method for predicting power supply failures in adata center, starting at block 300. A decision is made as to whether apower supply in a data center has failed in decision block 302. If not,the “NO” branch of decision block 302 is taken and the method loops backto decision block 302 until a power supply in the data center hasfailed. When a power supply in the data center has failed, the “YES”branch of decision block 302 is taken and the management information forthe data center related to the failing power supply is recorded in block304. A first non-failing power supply is selected in block 306, and themanagement information for the data center related to the firstnon-failing power supply is recorded in block 308. A failure probabilityfor the first non-failing power supply is determined in block 310. Forexample, a failure probability can be determined in accordance withEquation 1, as described above. A decision is made as to whether thefailure probability for the first non-failing power supply is higherthan a failure threshold in decision block 312. If not, the “NO” branchof decision block 312 is taken and the method returns to block 306 wherea next non-failing power supply is selected. If the failure probabilityfor the first non-failing power supply is higher than the failurethreshold, the “YES” branch of decision block 312 is taken, anindication of the high likelihood of failure of the first non-failingpower supply is given, and if the first non-failing power supply is aredundant power supply, the first non-failing power supply is swapped tothe standby state and the associated redundant power supply is switchedonline in block 314, and the method returns to block 306 where a nextnon-failing power supply is selected.

In a particular embodiment, where a server, a server rack, or a datacenter aisle includes one or more redundant power supplies, an elementof a management network, such as a BMC, a RMC, or a DCMC operates toperiodically test the standby power supply to ensure that the standbypower supply remains reliably ready to supply power to the server, theserver rack, or the data center aisle in the event of the failure of theprimary power supply. In particular, the BMC, the RMC, or the DCMCoperates to switch the standby power supply into an online mode and toswitch the main power supply into a standby mode. The BMC, the RMC, ofthe DCMC then operates to monitor the performance of the standby powersupply while online to determine if the standby power supply isoperating within acceptable limits, in terms of power efficiency,temperature, voltage stability, and the like. If the standby powersupply is not operating within the acceptable limits, the BMC, the RMC,of the DCMC operates to switch the standby power supply back to thestandby mode, to switch the main power supply back online, and toprovide an indication that the standby power supply is not operatingwithin the acceptable limits. If the standby power supply is operatingwithin the acceptable limits, the BMC, the RMC, or the DCMC can continueto operate the standby power supply in the online mode, to designate thestandby power supply as the main power supply, and to designate the mainpower supply as the standby power supply, or the BMC, the RMC, or theDCMC can switch the main power supply back to the online mode and switchthe standby power supply back to the standby mode.

As an example of this embodiment, consider server 230 with redundantpower supplies 234 and 236. Here, power supply 234 can be designated asthe main power supply and power supply 236 can be designated as thestandby power supply. BMC 231 determines an interval to periodicallytest power supply 236 to ensure that the standby power supply remainsreliably ready to supply power to server 230 in the event of the failureof the power supply 234. A testing interval can be determined as:

Interval=f(PS_age_differential,PS_drain)*Standard Interval  Eq. 2

where PS_age_differential is a factor that determines the difference inage between the main power supply and the standby power supply, PS_drainis a factor that evaluates the amount of usage the standby power supplyhas experienced, and Standard Interval is a predetermined interval fortesting standby power supplies. For example, PS_age_differential can befactored such that if the standby power supply is older than the mainpower supply, the Interval is reduced from the Standard Interval, andsuch that if the standby power supply is newer than the main powersupply, the Interval is equal to the Standard Interval. Further,PS_drain can be factored such that as the power supplied by the standbypower supply increases, the Interval is reduced from the StandardInterval. Finally, the Standard Interval can be selected as needed ordesired, such as a one-day interval, a one week interval, a monthlyinterval, or the like.

When BMC 231 determines that the Interval has been reached, the BMCoperates to switch power supply 236 into an online mode and to switchpower supply 234 into a standby mode. BMC 231 then monitors theperformance of power supply 236 while in the online mode to determine ifthe power supply is operating within acceptable limits, in terms ofpower efficiency, temperature, voltage stability, and the like. If powersupply 236 is not operating within the acceptable limits, BMC 231switches power supply 236 back to the standby mode, switches powersupply 234 back into the online mode, and provides an indication via themanagement network that power supply 236 is not operating within theacceptable limits. If power supply 236 is operating within theacceptable limits, then BMC 231 determines whether to continue tooperate power supply 236 in the online mode and swap the designation ofmain power supply from power supply 234 to power supply 236, or whetherto switch power supply 234 back to the online mode and switch powersupply 236 back to the standby mode.

FIG. 4 illustrates a method of minimizing power supply failures insystems with redundant power supplies, starting at block 400. A decisionis made as to whether power supplies in a data center are hot spareenabled in decision block 402. If not, the “NO” branch of decision block402 is taken and the method ends in block 416. If the power supplies arehot spare enabled, the “YES” branch of decision block 402 is taken andthe redundant power supplies in the data center are determined in block404. A management controller determines an interval for testing theredundant power supplies and waits for the interval to elapse in block406. When the interval has elapsed, the management controller switchesthe main power supply to the standby mode and switches the standby powersupply to the online mode in block 408.

A decision is made as to whether the power output from the standby powersupply is good in decision block 410. If not, the “NO” branch ofdecision block 410 is taken and an indication is provided by themanagement controller that the standby power supply is exhibiting poorhealth in block 412. The management controller switches the main powersupply to the online mode and switches the standby power supply to thestandby mode in block 414 and the method ends in block 416. Returning todecision block 410, if the power output from the standby power supply isgood, the “YES” branch is taken and a decision is made as to whether thepower output from the standby power supply remains stable for a testduration (T) in decision block 418. If not, the “NO” branch of decisionblock 418 is taken, and an indication is provided by the managementcontroller that the standby power supply is exhibiting poor health inblock 412. An example of the test duration (T) can be based upon anamount of time needed by the standby power supply to stabilizeoperations after being switched to the online mode, such as one (1)second, five (5) seconds, ten (10) seconds, or the like, as needed ordesired. The management controller switches the main power supply to theonline mode and switches the standby power supply to the standby mode inblock 414 and the method ends in block 416.

Returning to decision block 418, if the power output from the standbypower supply remains stable for the test duration (T), the “YES” branchis taken and a decision is made as to whether changes to the powersupply sources is permissible in decision block 420. For example, whereboth the main and the standby power supplies are powered by a common busbar, it may be advisable to swap the standby power supply into the rollof main power supply in order to evenly utilize the pair of powersupplies. On the other hand, where the main and standby power suppliesare powered by different bus bars, it may be advisable to retain theoriginal designations due to bus bar loading concerns. If changes to thepower supply sources are not permissible, the “NO” branch of decisionblock 420 is taken, the management controller switches the main powersupply to the online mode and switches the standby power supply to thestandby mode in block 414, and the method ends in block 416. If changesto the power supply sources are permissible, the “YES” branch ofdecision block 420 is taken, the main power supply is designated as thestandby power supply and the standby power supply is designated as themain power supply in block 422, and the method ends in block 416.

FIG. 5 illustrates a generalized embodiment of information handlingsystem 500. For purpose of this disclosure information handling system500 can include any instrumentality or aggregate of instrumentalitiesoperable to compute, classify, process, transmit, receive, retrieve,originate, switch, store, display, manifest, detect, record, reproduce,handle, or utilize any form of information, intelligence, or data forbusiness, scientific, control, entertainment, or other purposes. Forexample, information handling system 500 can be a personal computer, alaptop computer, a smart phone, a tablet device or other consumerelectronic device, a network server, a network storage device, a switchrouter or other network communication device, or any other suitabledevice and may vary in size, shape, performance, functionality, andprice. Further, information handling system 500 can include processingresources for executing machine-executable code, such as a centralprocessing unit (CPU), a programmable logic array (PLA), an embeddeddevice such as a System-on-a-Chip (SoC), or other control logichardware. Information handling system 500 can also include one or morecomputer-readable medium for storing machine-executable code, such assoftware or data. Additional components of information handling system500 can include one or more storage devices that can storemachine-executable code, one or more communications ports forcommunicating with external devices, and various input and output (I/O)devices, such as a keyboard, a mouse, and a video display. Informationhandling system 500 can also include one or more buses operable totransmit information between the various hardware components.

Information handling system 500 can include devices or modules thatembody one or more of the devices or modules described above, andoperates to perform one or more of the methods described above.Information handling system 500 includes a processors 502 and 504, achipset 510, a memory 520, a graphics interface 530, include a basicinput and output system/extensible firmware interface (BIOS/EFI) module540, a disk controller 550, a disk emulator 560, an input/output (I/O)interface 570, and a network interface 580. Processor 502 is connectedto chipset 510 via processor interface 506, and processor 504 isconnected to the chipset via processor interface 508. Memory 520 isconnected to chipset 510 via a memory bus 522. Graphics interface 530 isconnected to chipset 510 via a graphics interface 532, and provides avideo display output 536 to a video display 534. In a particularembodiment, information handling system 500 includes separate memoriesthat are dedicated to each of processors 502 and 504 via separate memoryinterfaces. An example of memory 520 includes random access memory (RAM)such as static RAM (SRAM), dynamic RAM (DRAM), non-volatile RAM(NV-RAM), or the like, read only memory (ROM), another type of memory,or a combination thereof.

BIOS/EFI module 540, disk controller 550, and I/O interface 570 areconnected to chipset 510 via an I/O channel 512. An example of I/Ochannel 512 includes a Peripheral Component Interconnect (PCI)interface, a PCI-Extended (PCI-X) interface, a high-speed PCI-Express(PCIe) interface, another industry standard or proprietary communicationinterface, or a combination thereof. Chipset 510 can also include one ormore other I/O interfaces, including an Industry Standard Architecture(ISA) interface, a Small Computer Serial Interface (SCSI) interface, anInter-Integrated Circuit (I²C) interface, a System Packet Interface(SPI), a Universal Serial Bus (USB), another interface, or a combinationthereof. BIOS/EFI module 540 includes BIOS/EFI code operable to detectresources within information handling system 500, to provide drivers forthe resources, initialize the resources, and access the resources.BIOS/EFI module 540 includes code that operates to detect resourceswithin information handling system 500, to provide drivers for theresources, to initialize the resources, and to access the resources.

Disk controller 550 includes a disk interface 552 that connects the disccontroller to a hard disk drive (HDD) 554, to an optical disk drive(ODD) 556, and to disk emulator 560. An example of disk interface 552includes an Integrated Drive Electronics (IDE) interface, an AdvancedTechnology Attachment (ATA) such as a parallel ATA (PATA) interface or aserial ATA (SATA) interface, a SCSI interface, a USB interface, aproprietary interface, or a combination thereof. Disk emulator 560permits a solid-state drive 564 to be connected to information handlingsystem 500 via an external interface 562. An example of externalinterface 562 includes a USB interface, an IEEE 1394 (Firewire)interface, a proprietary interface, or a combination thereof.Alternatively, solid-state drive 564 can be disposed within informationhandling system 500.

I/O interface 570 includes a peripheral interface 572 that connects theI/O interface to an add-on resource 574, to a TPM 576, and to networkinterface 580. Peripheral interface 572 can be the same type ofinterface as I/O channel 512, or can be a different type of interface.As such, I/O interface 570 extends the capacity of I/O channel 512 whenperipheral interface 572 and the I/O channel are of the same type, andthe I/O interface translates information from a format suitable to theI/O channel to a format suitable to the peripheral channel 572 when theyare of a different type. Add-on resource 574 can include a data storagesystem, an additional graphics interface, a network interface card(NIC), a sound/video processing card, another add-on resource, or acombination thereof. Add-on resource 574 can be on a main circuit board,on separate circuit board or add-in card disposed within informationhandling system 500, a device that is external to the informationhandling system, or a combination thereof.

Network interface 580 represents a NIC disposed within informationhandling system 500, on a main circuit board of the information handlingsystem, integrated onto another component such as chipset 510, inanother suitable location, or a combination thereof. Network interfacedevice 580 includes network channels 582 and 584 that provide interfacesto devices that are external to information handling system 500. In aparticular embodiment, network channels 582 and 584 are of a differenttype than peripheral channel 572 and network interface 580 translatesinformation from a format suitable to the peripheral channel to a formatsuitable to external devices. An example of network channels 582 and 584includes InfiniBand channels, Fibre Channel channels, Gigabit Ethernetchannels, proprietary channel architectures, or a combination thereof.Network channels 582 and 584 can be connected to external networkresources (not illustrated). The network resource can include anotherinformation handling system, a data storage system, another network, agrid management system, another suitable resource, or a combinationthereof.

While the computer-readable medium is shown to be a single medium, theterm “computer-readable medium” includes a single medium or multiplemedia, such as a centralized or distributed database, and/or associatedcaches and servers that store one or more sets of instructions. The term“computer-readable medium” shall also include any medium that is capableof storing, encoding, or carrying a set of instructions for execution bya processor or that cause a computer system to perform any one or moreof the methods or operations disclosed herein.

In a particular non-limiting, exemplary embodiment, thecomputer-readable medium can include a solid-state memory such as amemory card or other package that houses one or more non-volatileread-only memories. Further, the computer-readable medium can be arandom access memory or other volatile re-writable memory. Additionally,the computer-readable medium can include a magneto-optical or opticalmedium, such as a disk or tapes or other storage device to storeinformation received via carrier wave signals such as a signalcommunicated over a transmission medium. Furthermore, a computerreadable medium can store information received from distributed networkresources such as from a cloud-based environment. A digital fileattachment to an e-mail or other self-contained information archive orset of archives may be considered a distribution medium that isequivalent to a tangible storage medium. Accordingly, the disclosure isconsidered to include any one or more of a computer-readable medium or adistribution medium and other equivalents and successor media, in whichdata or instructions may be stored.

When referred to as a “device,” a “module,” or the like, the embodimentsdescribed herein can be configured as hardware. For example, a portionof an information handling system device may be hardware such as, forexample, an integrated circuit (such as an Application SpecificIntegrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), astructured ASIC, or a device embedded on a larger chip), a card (such asa Peripheral Component Interface (PCI) card, a PCI-express card, aPersonal Computer Memory Card International Association (PCMCIA) card,or other such expansion card), or a system (such as a motherboard, asystem-on-a-chip (SoC), or a stand-alone device).

The device or module can include software, including firmware embeddedat a processor or software capable of operating a relevant environmentof the information handling system. The device or module can alsoinclude a combination of the foregoing examples of hardware or software.Note that an information handling system can include an integratedcircuit or a board-level product having portions thereof that can alsobe any combination of hardware and software.

Devices, modules, resources, or programs that are in communication withone another need not be in continuous communication with each other,unless expressly specified otherwise. In addition, devices, modules,resources, or programs that are in communication with one another cancommunicate directly or indirectly through one or more intermediaries.

Although only a few exemplary embodiments have been described in detailherein, those skilled in the art will readily appreciate that manymodifications are possible in the exemplary embodiments withoutmaterially departing from the novel teachings and advantages of theembodiments of the present disclosure. Accordingly, all suchmodifications are intended to be included within the scope of theembodiments of the present disclosure as defined in the followingclaims. In the claims, means-plus-function clauses are intended to coverthe structures described herein as performing the recited function andnot only structural equivalents, but also equivalent structures.

What is claimed is:
 1. An information handling system, comprising: afirst power supply for a first system; a second power supply for asecond system; and a management controller configured to: detect thatthe first power supply has failed; receive first information from thefirst system related to operation of the first power supply prior to thefailure of the first power supply; receive second information from thesecond system associated with the second power supply; and determine aprobability of failure of the second power supply based upon acomparison of the first information with the second information.
 2. Theinformation handling system of claim 1, wherein the managementcontroller is further configured to: determine that the probability offailure is greater than a threshold.
 3. The information handling systemof claim 2, wherein the management controller is further configured to:provide an indication when the probability of failure is greater thanthe threshold.
 4. The information handling system of claim 3, furthercomprising: a third power supply for the second system, the third powersupply operating as a backup to the second power supply; wherein themanagement controller is further configured to direct the second systemto disable the second power supply and to enable the third power supplywhen the probability of failure is greater than the threshold.
 5. Theinformation handling system of claim 1, wherein the first informationincludes first hardware information for the first power supply, firstefficiency information for the first power supply at the time offailure, first age information of the first power supply at the time offailure, and first environmental information for the physicalenvironment of the first system at the time of failure.
 6. Theinformation handling system of claim 5, wherein the second informationincludes second hardware information for the second power supply, secondefficiency information for the second power supply, second ageinformation of the second power supply, and second environmentalinformation for the physical environment of the second system.
 7. Theinformation handling system of claim 6, wherein the probability offailure is based upon comparisons of the first hardware information withthe second hardware information, the first efficiency information withthe second efficiency information, the first age information with thesecond age information, and the first environmental information with thesecond environmental information.
 8. The information handling system ofclaim 1, wherein the management controller comprises a BaseboardManagement Controller.
 9. A method, comprising: detecting, by amanagement controller, that a first power supply for a first system hasfailed; receiving first information from the first system related to theoperation of the first power supply prior to the failure of the firstpower supply; receiving second information from a second system thatincludes a second power supply; and determining a probability of failureof the second power supply based upon a comparison of the firstinformation with the second information.
 10. The method of claim 9,further comprising: determining that the probability of failure isgreater than a threshold.
 11. The method of claim 10, furthercomprising: providing an indication when the probability of failure isgreater than the threshold.
 12. The method of claim 11, furthercomprising: directing the second system to disable the second powersupply and to enable a third power supply of the second system when theprobability of failure is greater than the threshold, the third powersupply operating as a backup to the second power supply.
 13. The methodof claim 9, wherein the first information includes first hardwareinformation for the first power supply, first efficiency information forthe first power supply at the time of failure, first age information ofthe first power supply at the time of failure, and first environmentalinformation for the physical environment of the first system at the timeof failure.
 14. The method of claim 13, wherein the second informationincludes second hardware information for the second power supply, secondefficiency information for the second power supply, second ageinformation of the second power supply, and second environmentalinformation for the physical environment of the second system.
 15. Themethod of claim 14, wherein in determining the probability of failure,the method further comprises: comparing the first hardware informationwith the second hardware information, the first efficiency informationwith the second efficiency information, the first age information withthe second age information, and the first environmental information withthe second environmental information.
 16. The method of claim 9, whereinthe management controller comprises a Baseboard Management Controller.17. An information handling system, comprising: a first power supply fora first system; a second power supply for a second system; a third powersupply for the second system, the third power supply operating as abackup to the second power supply; and a management controllerconfigured to: detect that the first power supply has failed; receivefirst information from the first system related to operation of thefirst power supply prior to the failure of the first power supply;receive second information from the second system associated with thesecond power supply; determine a probability of failure of the secondpower supply based upon a comparison of the first information with thesecond information; and direct the second system to disable the secondpower supply and to enable the third power supply when the probabilityof failure is greater than a threshold.
 18. The information handlingsystem of claim 17, wherein the first information includes firsthardware information for the first power supply, first efficiencyinformation for the first power supply at the time of failure, first ageinformation of the first power supply at the time of failure, and firstenvironmental information for the physical environment of the firstsystem at the time of failure.
 19. The information handling system ofclaim 18, wherein the second information includes second hardwareinformation for the second power supply, second efficiency informationfor the second power supply, second age information of the second powersupply, and second environmental information for the physicalenvironment of the second system.
 20. The information handling system ofclaim 19, wherein the probability of failure is based upon comparisonsof the first hardware information with the second hardware information,the first efficiency information with the second efficiency information,the first age information with the second age information, and the firstenvironmental information with the second environmental information.